Similarity- Based Estimation of Word 
Cooccurrence Probabilities * 

Ido Dagan 
Fernando Pereira 
AT&T Bell Laboratories 
600 Mountain Ave., Murray Hill, NJ 07974, USA 
dagan@research . att . com 
pereira@research . att . com 
Lillian Lee 

Division of Applied Sciences, Harvard University 
33 Oxford St. Cambridge MA 02138, USA 
llee@das . harvard . edu 

February 5, 2008 



Abstract 

In many applications of natural language processing it is necessary to 
determine the likelihood of a given word combination. For example, a 
speech recognizer may need to determine which of the two word combi- 
nations "eat a peach" and "eat a beach" is more likely. Statistical NLP 
methods determine the likelihood of a word combination according to its 
frequency in a training corpus. However, the nature of language is such 
that many word combinations are infrequent and do not occur in a given 
corpus. In this work we propose a method for estimating the probability 
of such previously unseen word combinations using available information 
on "most similar" words. 

We describe a probabilistic word association model based on distribu- 
tional word similarity, and apply it to improving probability estimates 
for unseen word bigrams in a variant of Katz's back-off model. The 
similarity-based method yields a 20% perplexity improvement in the pre- 
diction of unseen bigrams and statistically significant reductions in speech- 
recognition error. 



*To appear in the proceedings of the 32nd Annual Meeting of the Association for Compu- 
tational Linguistics, New Mexico State University, June 1994. 
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1 Introduction 



Data sparseness is an inherent problem in statistical methods for natural lan- 
guage processing. Such methods use statistics on the relative frequencies of 
configurations of elements in a training corpus to evaluate alternative analyses 
or interpretations of new samples of text or speech. The most likely analysis 
will be taken to be the one that contains the most frequent configurations. The 
problem of data sparseness arises when analyses contain configurations that 
never occurred in the training corpus. Then it is not possible to estimate prob- 
abilities from observed frequencies, and some other estimation scheme has to be 
used. 

We focus here on a particular kind of configuration, word cooccurrence. Ex- 
amples of such cooccurrences include relationships between head words in syn- 
tactic constructions (verb-object or adjective-noun, for example) and word se- 
quences (n-grams). In commonly used models, the probability estimate for a 
previously unseen cooccurrence is a function of the probability estimates for the 
words in the cooccurrence. For example, in the bigram models that we study 
here, the probability P(w2\wi) of a conditioned wordw2 that has never occurred 
in training following the conditioning word wi is calculated from the probabil- 



ity of u>2, as estimated by W2 , s frequency in the corpus (Jclinek, Mercer, and 



Roukos, 1992; Katz, 1987). This method depends on an independence assump- 
tion on the cooccurrence of w± and u>2 : the more frequent u>2 is, the higher will 
be the estimate of P(w2\wi), regardless of ui\. 

Class-based and similarity-based models provide an alternative to the inde- 
pendence assumption. In those models, the relationship between given words 
is modeled by analogy with other words that are in some sense similar to the 
given ones. 

Brown ct al. (1992) suggest a class-based n-gram model in which words 
with similar cooccurrence distributions are clustered in word classes. The cooc- 
currence probability of a given pair of words then is estimated accordi ng to an 



averaged cooccurrence probability of the two corresponding classes. Pereira 
Kshby, and Lee (1993 ) propose a "soft" clustering scheme for certain gram- 
matical cooccurrences in which membership of a word in a class is probabilistic. 
Cooccurrence probabilities of words are then modeled by averaged cooccurrence 

probabilities of word clusters. 

Dagan, Markus, and Markovitch (1993) argue that reduction to a relatively 
small number of predetermined word classes or clusters may cause a substantial 
loss of information. Their similarity-based model avoids clustering altogether. 
Instead, each word is modeled by its own specific class, a set of words which are 
most similar to it (as in k-nearest neighbor approaches in pattern recognition). 
Using this scheme, they predict which unobserved cooccurrences are more likely 
than others. Their model, however, is not probabilistic, that is, it does not 
provide a probability estimate for unobserved cooccurrences. It cannot there- 
fore be used in a complete probabilistic framework, such as n-gram language 
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models or probabilistic lexicalized grammars ( Bchabcs, 1992| [LafTcrty, Sleator, 
and Tcmpcrley, 1992| ). 



We now give a similarity-based method for estimating the probabilities of 
cooccurrences unseen in training. Similarity-based estimation was first used for 
language m odeling in the cooccurrence smoothing method of Essen and Stcin^ 



biss (1992 ), derived from work on acoustic model smoothing by Sugawara et al.| 



(1985 ). We p resent a diff erent method that takes as starting point the back- 
off scheme of Katz (1987 ). We first allocate an appropriate probability mass 
for unseen cooccurrences following the back-off method. Then we redistribute 
that mass to unseen cooccurrences according to an averaged cooccurrence dis- 
tribution of a set of most similar conditioning words, using relative entropy as 
our similarity measure. This second step replaces the use of the independence 
assumption in the original back-off model. 

We applied our method to estimate unseen bigram probabilities for Wall 
Street Journal text and compared it to the standard back-off model. Testing on 
a held-out sample, the similarity model achieved a 20% reduction in perplexity 
for unseen bigrams. These constituted just 10.6% of the test sample, leading 
to an overall reduction in test-set perplexity of 2.4%. We also experimented 
with an application to language modeling for speech recognition, which yielded 
a statistically significant reduction in recognition error. 

The remainder of the discussion is presented in terms of bigrams, but it is 
valid for other types of word cooccurrence as well. 



2 Discounting and Redistribution 



Many low-probability bigrams will be missing from any finite sample. Yet, the 
aggregate probability of all these unseen bigrams is fairly high; any new sample 
is very likely to contain some. 

Because of data sparseness, we cannot reliably use a maximum likelihood 
estimator (MLE) for bigram probabilities. The MLE for the probability of a 
bigram {wi,W2) is simply: 



Pml(wi,w 2 ) 



c(wi,w 2 ) 
N 



(1) 



where c{w\, W2) is the frequency of (u>i, W2) in the training corpus and N is the 
total number of bigrams. However, this estimates the probability of any unseen 
bigram to be zero, which is clearly undesirable. 



Previous proposals to circumvent the above problem (Good, 1953; Jelinek, 
Mercer, and Roukos, 1992] ; [Katz, 1987fc |Church and Gale, 199l| ) take the MLE 
as an initial estimate and adjust it so that the total probability of seen bigrams 
is less than one, leaving some probability mass for unseen bigrams. Typically, 
the adjustment involves either interpolation, in which the new estimator is a 
weighted combination of the MLE and an estimator that is guaranteed to be 
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nonzero for unseen bigrams, or discounting, in which the MLE is decreased 
according to a model of the unreliability of small frequency counts, leaving 
some probability mass for unseen bigrams. 



The back-off model of Katz (1987) provides a clear separation between fre- 
quent events, for which observed frequencies are reliable probability estimators, 
and low-frequency events, whose prediction must involve additional information 
sources. In addition, the back-off model does not require complex estimations 
for interpolation parameters. 

A back-off model requires methods for (a) discounting the estimates of pre- 
viously observed events to leave out some positive probability mass for unseen 
events, and (b) redistributing among the unseen events the probability mass 
freed by discounting. For bigrams the resulting estimator has the general form 

Pfwolw!) = /-^(^H) ifc(wl,w2)>0 , 2) 

1 a(wi)P r (w2 1 wi) otherwise 

where Pd represents the discounted estimate for seen bigrams, P r the model for 
probability redistribution among the unseen bigrams, and a(w) is a normaliza- 
tion factor. Since the overall mass left for unseen bigrams starting with w\ is 
given by 

P( Wl ) = l- Yl P d(w2\wi) 

W 2 :c(wi,W 2 )>0 

the normalization factor required to ensure J2w 2 p ( w 2\ w i) = 1 is 

a ( Wl > = ^ wi — i — \ 



1 -Y, W2 :c( Wl , W2 )>0 P r{w 2 \w 1 ) 

The second formulation of the normalization is computationally preferable be- 
cause the total number of possible bigram types far exceeds the number of 
observed types. Equation (||) modifies slightly Katz's presentation to include 
the placeholder P r for alternative models of the distribution of unseen bigrams. 

Katz uses the Good- Turing formula to replace the actual frequency c{w\ 1 w 2 ) 
of a bigram (or an event, in general) with a discounted frequency, c*{w\,w 2 ), 
defined by 

C {Wi,W 2 ) = {C{Wi,W2) + 1) , (3) 

where n c is the number of different bigrams in the corpus that have frequency c. 
He then uses the discounted frequency in the conditional probability calculation 
for a bigram: 

e> { 1 \ c*(w 1 ,w 2 ) , . 

Pd{W 2 \Wi) = r . (4) 

c{wi) 
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In the original Good- Turing method (Good, 1953) the free probability mass 
is redistributed uniformly among all unseen events. Instead, Katz's back-off 
scheme redistributes the free probability mass non-uniformly in proportion to 
the frequency of w 2 , by setting 

P r {w 2 \ Wl ) =P(W 2 ) . (5) 

Katz thus assumes that for a given conditioning word w\ the probability of an 
unseen following word w 2 is proportional to its unconditional probability. How- 
ever, the overall form of the model (||) does not depend on this assumption, 
and we will next investigate an estimate for P r (w 2 \wi) derived by averaging 
estimates for the conditional probabilities that w 2 follows words that are distri- 
butionally similar to w\. 



3 The Similarity Model 

Our scheme is based on the assumption that words that are "similar" to w\ 
can provide good predictions for the distribution of W\ in unseen bigrams. Let 
S(w\) denote a set of words which are most similar to w\, as determined by 
some similarity metric. We define Psim(w2|wi), the similarity-based model for 
the conditional distribution of w\, as a weighted average of the conditional 
distributions of the words in S(w\): 

Psas.{w2\wx) = 

where W(w l 1 ,W\) is the (unnormalized) weight given to w[, determined by its 
degree of similarity to w\ . According to this scheme, w 2 is more likely to follow 
wi if it tends to follow words that are most similar to W\. To complete the 
scheme, it is necessary to define the similarity metric and, accordingly, S{w\) 
and W(w[ , wi). 



Following Pereira, Tishby, and Lee (1993 ) , we measure word similarity by the 
relative entropy, or Kullback-Leibler (KL) distance, between the corresponding 
conditional distributions 

D{ Wl \\w[)=Y J P{^)lo g ^^ (7) 

The KL distance is when w\ = w'i, and it increases as the two distribution 
are less similar. 

To compute (^) and (R) we must have nonzero estimates of P(w2\w\) when- 
ever necessary for (]?]) to be defined. We use the estimates given by the standard 
back-off model, which satisfy that requirement. Thus our application of the sim- 
ilarity model averages together standard back-off estimates for a set of similar 
conditioning words. 
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We define S(wx) as the set of at most k nearest words to Wx (excluding w\ 
itself), that also satisfy D(wx \\ w[) < t. k and t are parameters that control 
the contents of S{w\) and are tuned experimentally, as we will see below. 

W(w[,wi) is defined as 

W(w[, wx) = exp -j3D{w x \\ w' x ) 

The weight is larger for words that are more similar (closer) to w\. The pa- 
rameter (3 controls the relative contribution of words in different distances from 
W\'. as the value of f3 increases, the nearest words to wx get relatively more 
weight. As f3 decreases, remote words get a larger effect. Like k and t, (3 is 
tuned experimentally. 

Having a definition for P$im(w2\w\) , we could use it directly as P r (w 2 \wx) 
in the back-off scheme (§). We found that it is better to smooth Psim(w , 2|wi) 
by interpolating it with the unigram probability P{w2) (recall that Katz used 
P(w2) as P r (w2\wx))- Using linear interpolation we get 

P r (w 2 \wx) = jP(w 2 ) + (1 - j)Psim(w2\wx) , (8) 

where 7 is an experimentally-determined interpolation parameter. This smooth- 
ing appears to compensate for inaccuracies in Psim{w2\wi), mainly for infre- 
quent conditioning words. However, as the evaluation below shows, good values 
for 7 are small, that is, the similarity-based model plays a stronger role than 
the independence assumption. 

To summarize, we construct a similarity-based model for P(w2\wi) and then 
interpolate it with P{w2). The interpolated model (||) is used in the back- 
off scheme as P r (w2\wx), to obtain better estimates for unseen bigrams. Four 
parameters, to be tuned experimentally, are relevant for this process: k and t, 
which determine the set of similar words to be considered, /3, which determines 
the relative effect of these words, and 7, which determines the overall importance 
of the similarity-based model. 



4 Evaluation 



We evaluated our method by comparing its perplexity^] and effect on speech- 
recognition accuracy with the baseline bigram back-off model developed by MIT 
Lincoln Laboratories for the Wall Street Journal (WSJ) text and dictation cor- 
pora provided by ARPA's HLT program ( [Paul, f99f| )-p| The baseline back-off 



The perplexity of a conditional bigram probability model P with res pect to the true 



bigram distributio n is an information-theoretic measure of model quality (Jelinek, Mercer, 



find Roukos, 1992| ) that can be empirically estimated by exp — y"\ log P(wj | Wj-\) for a 
test set of length N. Intuitively, the lower the perplexity of a model the more likely the model 
is to assign high probability to bigrams that actually occur. In our task, lower perplexity will 
indicate better prediction of unseen bigrams. 

2 The ARPA WSJ development corpora come in two versions, one with verbalized punctu- 
ation and the other without. We used the latter in all our experiments. 
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4 
4 
4 
4 
4 
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0.15 
0.15 
0.2 
0.25 
0.1 
0.1 
0.1 
0.1 
0.3 
0.3 



18.4 
18.38 
18.34 
18.33 

18.3 
18.25 
18.23 
18.23 
18.04 
16.64 



20.51 
20.45 
20.03 
19.76 
20.53 
20.55 
20.54 
20.59 
18.7 
16.94 



Table 1: Perplexity Reduction on Unseen Bigrams for Different Model Param- 
eters 

model follows closely the Katz design, except that for compactness all frequency 
one bigrams are ignored. The counts used in this model and in ours were ob- 
tained from 40.5 million words of WSJ text from the years 1987-89. 

For perplexity evaluation, we tuned the similarity model parameters by min- 
imizing perplexity on an additional sample of 57.5 thousand words of WSJ text, 
drawn from the ARPA HLT development test set. The best parameter values 
found were k = 60, t = 2.5, (3 = 4 and 7 = 0.15. For these values, the improve- 
ment in perplexity for unseen bigrams in a held-out 18 thousand word sample, 
in which 10.6% of the bigrams are unseen, is just over 20%. This improvement 
on unseen bigrams corresponds to an overall test set perplexity improvement 
of 2.4% (from 237.4 to 231.7). Table |l| shows reductions in training and test 
perplexity, sorted by training reduction, for different choices in the number k 
of closest neighbors used. The values of (3, 7 and t are the best ones found for 



From equation (g), it is clear that the computational cost of applying the 
similarity model to an unseen bigram is 0(k). Therefore, lower values for k (and 
also for t) are computationally preferable. From the table, we can see that 
reducing k to 30 incurs a penalty of less than 1% in the perplexity improvement, 
so relatively low values of k appear to be sufficient to achieve most of the benefit 
of the similarity model. As the table also shows, the best value of 7 increases 
as k decreases, that is, for lower k a greater weight is given to the conditioned 
word's frequency. This suggests that the predictive power of neighbors beyond 
the closest 30 or so can be modeled fairly well by the overall frequency of the 
conditioned word. 

The bigram similarity model was also tested as a language model in speech 

3 Values of /3 and t refer to base 10 logarithms and exponentials in all calculations. 
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•j i r l i p 1 1 i i ji •j , i*n* iii 

commitments . . . trom leaders relt the three point six billion dollars 


q 

o 


commitments . . . from leaders fell to three point six billion dollars 


B 


followed by France the US agreed in Italy 


S 


followed by France the US Greece . . . Italy 


B 


he whispers to made a 


S 


he whispers to an aide 


B 


the necessity for change exist 


S 


the necessity for change exists 


B 


without . . . additional reserves Centrust would have reported 


S 


without . . . additional reserves of Centrust would have reported 


B 


in the darkness past the church 


S 


in the darkness passed the church 



Table 2: Speech Recognition Disagreements between Models 



recognition. The test data for this experiment were pruned word lattices for 403 
WSJ closed-vocabulary test sentences. Arc scores in those lattices are sums of 
an acoustic score (negative log likelihood) and a language-model score, in this 
case the negative log probability provided by the baseline bigram model. 

From the given lattices, we constructed new lattices in which the arc scores 
were modified to use the similarity model instead of the baseline model. We 
compared the best sentence hypothesis in each original lattice and in the modi- 
fied one, and counted the word disagreements in which one of the hypotheses is 
correct. There were a total of 96 such disagreements. The similarity model was 
correct in 64 cases, and the back-off model in 32. This advantage for the simi- 
larity model is statistically significant at the 0.01 level. The overall reduction in 
error rate is small, from 21.4% to 20.9%, because the number of disagreements 
is small compared with the overall number of errors in our current recognition 
setup. 

Table [2] shows some examples of speech recognition disagreements between 
the two models. The hypotheses are labeled 'B' for back-off and 'S' for similarity, 
and the bold-face words are errors. The similarity model seems to be able to 
model better regularities such as semantic parallelism in lists and avoiding a past 
tense form after "to." On the other hand, the similarity model makes several 
mistakes in which a function word is inserted in a place where punctuation 
would be found in written text. 
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5 Related Work 



The cooccurrence smoothing technique (Essen and Steinbiss, 1992), based on 
earlier stochastic speech modeling work by Sugawara et al. (1985), is the main 
previous attempt to use similarity to estimate the probability of unseen events 
in language modeling. In addition to its original use in language modeling 



for speech recognition, Grishman and Sterling (1993| ) applied the cooccurrence 



smoothing technique to estimate the likelihood of sclcctional patterns. We will 
outline here the main parallels and differences between our method and cooc- 
currence smoothing. A more detailed analysis would require an empirical com- 
parison of the two methods on the same corpus and task. 

In cooccurrence smoothing, as in our method, a baseline model is combined 
with a similarity-based model that refines some of its probability estimates. 
The similarity model in cooccurrence smoothing is based on the intuition that 
the similarity between two words w and w' can be measured by the confusion 
probability Pc(w'\w) that w' can be substituted for w in an arbitrary context 
in the training corpus. Given a baseline probability model P, which is taken to 
be the MLE, the confusion probability Pc(wi \w\) between conditioning words 
w[ and wi is defined as 

P c {w[\wx) = 

p^E W2 P (wi\w2)P«\w 2 )P(w 2 ) ' W 

the probability that wi is followed by the same context words as w[. Then the 
bigram estimate derived by cooccurrence smoothing is given by 

Ps(w 2 \w! 



Notice that this formula has the same form as our similarity model (Q), ex- 
cept that it uses confusion probabilities where we use normalized weights.^ In 
addition, we restrict the summation to sufficiently similar words, whereas the 
cooccurrence smoothing method sums over all words in the lexicon. 

The similarity measure (^) is symmetric in the sense that Pc(w'\w) and 
Pc(w\w') are identical up to frequency normalization, that is pp[™||^] = pff) ■ 
In contrast, D(w \\ w') (fjj) is asymmetric in that it weighs each context in 



4 This presentation corresponds to model 2-B in Essen and Steinbiss (1992). Their presen- 
tation follows the equivalent model 1-A, which averages over similar conditioned words, with 
the similarity defined with the preceding word as context. In fact, these equivalent models 
are symmetric in their treatment of conditioning and conditioned word, as they can both be 
rewritten as 

Ps( w 2\w\) = S ' P(w2\w' 1 )P(w' 1 \w' 2 )P(w' 2 \wi) . 

They also consider other definitions of confusion probability and smoothed probability esti- 
mate, but the one above yielded the best experimental results. 
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proportion to its probability of occurrence with w, but not with w' . In this 
way, if w and w' have comparable frequencies but w' has a sharper context 
distribution than w, then D(w' \\ w) is greater than D(w \\ w). Therefore, 
in our similarity model w' will play a stronger role in estimating w than vice 
versa. These properties motivated our choice of relative entropy for similarity 
measure, because of the intuition that words with sharper distributions are more 
informative about other words than words with flat distributions. 

Finally, while we have used our similarity model only for missing bigrams 
in a back-off scheme, Essen and Steinbiss (1992 ) used linear interpolation for 
all bigrams to combine the cooccurrence smoothing model with MLE models of 
bigrams and unigrams. Notice, however, that the choice of back-off or interpo- 
lation is independent from the similarity model used. 



6 Further Research 

Our model provides a basic scheme for probabilistic similarity-based estimation 
that can be developed in several directions. First, variations of (^) may be tried, 
such as different similarity metrics and different weighting schemes. Also, some 
simplification of the current model parameters may be possible, especially with 
respect to the parameters t and k used to select the nearest neighbors of a word. 
A more substantial variation would be to base the model on similarity between 
conditioned words rather than on similarity between conditioning words. 

Other evidence may be combined with the similarity-based estimate. For 
instance, it may be advantageous to weigh those estimates by some measure of 
the reliability of the similarity metric and of the neighbor distributions. A sec- 
ond possibility is to take into account negative evidence: if w\ is frequent, but 
W2 never followed it, there may be enough statistical evidence to put an upper 
bound on the estimate of P(w2\w\). This may require an adjustment of the sim- 



ilarity based estimate, possibly along the lines of ( Rosenfeld and Huang, 1992 ). 
Third, the similarity-based estimate can be used to smooth the maximum like- 
lihood estimate for small nonzero frequencies. If the similarity-based estimate 
is relatively high, a bigram would receive a higher estimate than predicted by 
the uniform discounting method. 

Finally, the similarity-based model may be applied to configurations other 
than bigrams. For trigrams, it is necessary to measure similarity between differ- 
ent conditioning bigrams. This can be done directly, by measuring the distance 
between distributions of the form P(wa\wi,W2), corresponding to different bi- 
grams (wi,W2)- Alternatively, and more practically, it would be possible to 
define a similarity measure between bigrams as a function of similarities be- 
tween corresponding words in them. Other types of conditional cooccurrence 



probabilities have been used in probabilistic parsing (Black et al., 1993). If the 
configuration in question includes only two words, such as P(object\verb), then 
it is possible to use the model we have used for bigrams. If the configuration 
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includes more elements, it is necessary to adjust the method, along the lines 
discussed above for trigrams. 



7 Conclusions 

Similarity-based models suggest an appealing approach for dealing with data 
sparseness. Based on corpus statistics, they provide analogies between words 
that often agree with our linguistic and domain intuitions. In this paper we 
presented a new model that implements the similarity-based approach to provide 
estimates for the conditional probabilities of unseen word cooccurrences. 

Our method combines similarity-based estimates with Katz's back-off scheme, 
which is widely used for language modeling in speech recognition. Although the 
scheme was originally proposed as a preferred way of implementing the inde- 
pendence assumption, we suggest that it is also appropriate for implementing 
similarity-based models, as well as class-based models. It enables us to rely on 
direct maximum likelihood estimates when reliable statistics are available, and 
only otherwise resort to the estimates of an "indirect" model. 

The improvement we achieved for a bigram model is statistically significant, 
though modest in its overall effect because of the small proportion of unseen 
events. While we have used bigrams as an easily-accessible platform to develop 
and test the model, more substantial improvements might be obtainable for 
more informative configurations. An obvious case is that of trigrams, for which 
the sparse data problem is much more severe.^ Our longer-term goal, however, 
is to apply similarity techniques to linguistically motivated word cooccurrence 



configurations, as suggested by lexicalized approaches to parsing (Schabes, 1992 



Laffcrty, Sleator, and Temperley, 1992). In configurations like verb-object and 



adjective-noun, there is some evidence (Pereira, Tishby, and Lee, 1993) that 
sharper word cooccurrence distributions are obtainable, leading to improved 
predictions by similarity techniques. 
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